1. Introduction

The New York City Taxi and Limousine Commission (TLC), created in 1971, is the agency responsible for licensing and regulating New York City’s medallion (yellow) taxis, street hail livery (green) taxis, for-hire vehicles (FHVs), commuter vans, and paratransit vehicles. The TLC cooperates with taxi technology providers (now called technology service providers, or TSPs) to collect trip record information for each taxi and FHV trip completed by licensed drivers and vehicles.

Taxi trip data can be acquired from the TLC website (https://www.nyc.gov/site/tlc/about/tlc-trip-record-data.page), where the trip records are published, separated by year, month and vehicle type (yellow/green/FHV/High-volume FHV). Among the four vehicle types, we will first narrow our target down to yellow and green taxis, as they are the “traditional” taxi types that respond to street hails, as well as being incorporated under a more reliable source of data collection, contrary to FHV trip records that rely on corporations such as Uber, Lyft, etc.

Regarding the two taxi types, we can easily observe that the usage of green, or the “boro” taxis, are very limited compared to yellow taxis; this is mainly due to their specific purpose of serving outer boroughs, which limits vehicles from picking up new passengers within the “yellow zone” of Manhattan, or within airports. This has led to 86% plunge in numbers of operating green cabs from 6,500 in 2015 to less than 900 in 2023.

It is clear that the nature of green taxis does not fit our purpose of understanding taxi demand patterns across all of New York City, as well as leading to very limited usage compared to yellow taxis. Therefore, the dataset used for this analysis will consist of trip record data of yellow taxis only.

Reference (decline of green taxi population): (https://www.nbcnewyork.com/news/local/green-cabs-are-being-phased-out-heres-what-will-replace-them/4302496/#:~:text=The%20Taxi%20and%20Limousine%20Commission,%25%20plunge%2C%20The%20City%20reported.)

2. Dataset

2-1. Introduction to Yellow Taxi Trip Data & Cleaning

## # A tibble: 6 × 18
##   vendor_name Trip_Pickup_DateTime Trip_Dropoff_DateTime Passenger_Count
##   <chr>       <chr>                <chr>                           <int>
## 1 VTS         2009-01-04 02:52:00  2009-01-04 03:02:00                 1
## 2 VTS         2009-01-04 03:31:00  2009-01-04 03:38:00                 3
## 3 VTS         2009-01-03 15:43:00  2009-01-03 15:57:00                 5
## 4 DDS         2009-01-01 20:52:58  2009-01-01 21:14:00                 1
## 5 DDS         2009-01-24 16:18:23  2009-01-24 16:24:56                 1
## 6 DDS         2009-01-16 22:35:59  2009-01-16 22:43:35                 2
## # ℹ 14 more variables: Trip_Distance <dbl>, Start_Lon <dbl>, Start_Lat <dbl>,
## #   Rate_Code <dbl>, store_and_forward <dbl>, End_Lon <dbl>, End_Lat <dbl>,
## #   Payment_Type <chr>, Fare_Amt <dbl>, surcharge <dbl>, mta_tax <dbl>,
## #   Tip_Amt <dbl>, Tolls_Amt <dbl>, Total_Amt <dbl>

Raw dataset provided by the TLC consists of 18 columns, with quite self-explanatory column names. Our key variables would be:
- Trip_Pickup_DateTime and Trip_Dropoff_DateTime, representing temporal information; - Lon/Lat columns, representing spatial information.

However, for the spatial portion of the data, it has seen a major change recently. The TLC no longer provides the coordinates of pickup/dropoff locations, that are replaced by location ID information that represents which “taxi zone” that each location falls into. While the exact lon/lat would better serve our purpose, only the records from 2009 and 2010 are available in such format. Since the scope of this analysis is to address recent trends of taxi demand, we have decided to rely on the taxi zone shapefile, also provided by the TLC, to address spatial nature of taxi data.

We will use the yellow taxi trip records of August 2024, which is the most recent data that has been published by the TLC, with sample size of 1 million trips.

ggplot()+
  annotation_map_tile(type = "osm", zoom = 12) +
  geom_sf(data = zoneshp)+
  labs(title = "NYC Taxi Zone Boundaries",
       x = "Longitude", y = "Latitude") +
  theme_minimal()
## Zoom: 12

The TLC has divided New York City into 263 taxi zones, which are represented by “PULocationID”,“DOLocationID” columns in the dataset. In order to provide coordinates for each row in order to conduct spatial analysis, we calculate centroid coordinates for each taxi zone and add it to the original dataset:

##        PULocationID DOLocationID PU_Longitude PU_Latitude DO_Longitude
## 308175          237          161    -73.96563    40.76862    -73.97770
## 338694          100          186    -73.98879    40.75351    -73.99244
## 54621           161          114    -73.97770    40.75803    -73.99738
## 36044           100           13    -73.98879    40.75351    -74.01608
## 802089           75           75    -73.94575    40.79001    -73.94575
## 77387           163          162    -73.97757    40.76442    -73.97236
##        DO_Latitude
## 308175    40.75803
## 338694    40.74850
## 54621     40.72834
## 36044     40.71204
## 802089    40.79001
## 77387     40.75669

Six columns, as shown above, provides spatial information, especially the last 4 parameters that was calculated based on the first two from original data.

##   tpep_pickup_datetime tpep_dropoff_datetime  Duration
## 1  2024-08-10 16:40:15   2024-08-10 16:58:01 17.766667
## 2  2024-08-01 13:21:49   2024-08-01 13:37:04 15.250000
## 3  2024-08-29 12:22:11   2024-08-29 12:30:48  8.616667
## 4  2024-08-16 14:53:22   2024-08-16 14:57:06  3.733333
## 5  2024-08-10 02:00:23   2024-08-10 02:22:08 21.750000
## 6  2024-08-14 08:27:33   2024-08-14 08:42:31 14.966667

Another parameter, ‘Duration’ has been added by calculating the gap between pickup time and dropoff time for each row in minutes. Data points with missing pickup and/or dropoff time, as well as negative ‘Duration’ value were removed in the process.

trip_distance, the total traveled distance of each trip recorded in miles, is another parameter that played a crucial role in the data cleaning process, since many data points had abnormal values, such as zero or extremely large numbers. Such data points were treated as outliers and have been removed, resulting in 821022 rows remaining.

ggplot()+
  annotation_map_tile(type = "osm", zoom = 12) +
  geom_sf(data = zoneshp, aes(fill = borough))
## Zoom: 12

Taxi zone shapefile also provides useful information about the TLC’s taxi zone system. Based on the New York City’s subdivision (‘borough’) information from the data, we can potentially conduct borough-based analysis as well. Furthermore, it is notable that the legend consists of ‘EWR’ as well as five boroughs of NYC, with only one taxi zone, ‘Newark Airport’ under this category. While the Newark International Airport is administratively no longer a part of NYC since October of 2022, it still covers a notable volume of NYC residents’ travels due to its proximity for certain regions. The TLC has seemingly acknowledged such history and decided to keep EWR considered as a within-NYC taxi zone.

2-2. Exploratory Data Analysis

1) Pickup/Dropoff Counts by Zones

Total counts of pickup/dropoff occurrences from each zone is a simple, yet effective way of understanding patterns of taxi demand.

ggplot()+
  annotation_map_tile(type = "osm", zoom = 12) +
  geom_sf(data = zoneshp_2, aes(fill = PUCounts))+
  scale_fill_distiller(palette = "Spectral")+
  labs(title = "NYC Yellow Cab Pickup Counts by Taxi Zone",
       x = "Longitude", y = "Latitude") +
  theme_minimal()
## Zoom: 12

We can further retrieve top 10 zones in terms of pickup counts for more details:

top_n(zoneshp_2, 10, PUCounts)[,5:7]
## Simple feature collection with 10 features and 3 fields
## Geometry type: MULTIPOLYGON
## Dimension:     XY
## Bounding box:  xmin: 981976.6 ymin: 206851.1 xmax: 998281.4 ymax: 226338.3
## Projected CRS: NAD83 / New York Long Island (ftUS)
##                            zone   borough PUCounts
## 1           Lincoln Square East Manhattan    25633
## 2                Midtown Center Manhattan    43679
## 3                  Midtown East Manhattan    32666
## 4                   Murray Hill Manhattan    27706
## 5  Penn Station/Madison Sq West Manhattan    32872
## 6     Times Sq/Theatre District Manhattan    29769
## 7                      Union Sq Manhattan    25291
## 8         Upper East Side North Manhattan    32212
## 9         Upper East Side South Manhattan    38639
## 10                 East Chelsea Manhattan    25916
##                          geometry
## 1  MULTIPOLYGON (((989380.3 21...
## 2  MULTIPOLYGON (((991081 2144...
## 3  MULTIPOLYGON (((992224.4 21...
## 4  MULTIPOLYGON (((991999.3 21...
## 5  MULTIPOLYGON (((986752.6 21...
## 6  MULTIPOLYGON (((988786.9 21...
## 7  MULTIPOLYGON (((987029.8 20...
## 8  MULTIPOLYGON (((995940 2211...
## 9  MULTIPOLYGON (((993633.4 21...
## 10 MULTIPOLYGON (((983690.4 20...

Manhattan borough, especially midtown Manhattan area, is dominant in terms of pickup counts.

zoneshp_2 %>% filter(is.na(PUCounts))
## Simple feature collection with 20 features and 7 fields
## Geometry type: MULTIPOLYGON
## Dimension:     XY
## Bounding box:  xmin: 913175.1 ymin: 120121.9 xmax: 1049020 ymax: 230302.8
## Projected CRS: NAD83 / New York Long Island (ftUS)
## First 10 features:
##    LocationID OBJECTID Shape_Leng   Shape_Area
## 1         103      104 0.02122083 1.192053e-05
## 2         103      105 0.07742534 3.686364e-04
## 3         103      103 0.01430552 6.330564e-06
## 4         109      109 0.17826782 1.169601e-03
## 5         110      110 0.10394629 5.257451e-04
## 6         111      111 0.05993088 2.086833e-04
## 7         118      118 0.24396622 1.826939e-03
## 8         156      156 0.14447689 1.052122e-03
## 9         172      172 0.11847612 6.584025e-04
## 10        187      187 0.12686843 4.211958e-04
##                                             zone       borough PUCounts
## 1  Governor's Island/Ellis Island/Liberty Island     Manhattan       NA
## 2  Governor's Island/Ellis Island/Liberty Island     Manhattan       NA
## 3  Governor's Island/Ellis Island/Liberty Island     Manhattan       NA
## 4                                    Great Kills Staten Island       NA
## 5                               Great Kills Park Staten Island       NA
## 6                            Green-Wood Cemetery      Brooklyn       NA
## 7                    Heartland Village/Todt Hill Staten Island       NA
## 8                                Mariners Harbor Staten Island       NA
## 9                         New Dorp/Midland Beach Staten Island       NA
## 10                                 Port Richmond Staten Island       NA
##                          geometry
## 1  MULTIPOLYGON (((973172.7 19...
## 2  MULTIPOLYGON (((979605.8 19...
## 3  MULTIPOLYGON (((972079.6 19...
## 4  MULTIPOLYGON (((943392.6 14...
## 5  MULTIPOLYGON (((951420.1 13...
## 6  MULTIPOLYGON (((985590.4 17...
## 7  MULTIPOLYGON (((954167.8 16...
## 8  MULTIPOLYGON (((934327.5 17...
## 9  MULTIPOLYGON (((960204.8 14...
## 10 MULTIPOLYGON (((946964.1 17...

There are also several taxi zones in gray, which have NA values for ‘PUCounts’ column, due to having no pickup from such locations recorded in this particular dataset. However, it is notable that ‘Governor’s Island/Ellis Island/Liberty Island’ (row 1,2 and 3) will always have zero pickup counts, since these areas can only be accessed by ferry boats.

ggplot()+
  annotation_map_tile(type = "osm", zoom = 12) +
  geom_sf(data = zoneshp_2, aes(fill = DOCounts))+
  scale_fill_distiller(palette = "Spectral")+
  labs(title = "NYC Yellow Cab Dropoff Counts",
       x = "Longitude", y = "Latitude") +
  theme_minimal()
## Zoom: 12

## Simple feature collection with 10 features and 4 fields
## Geometry type: MULTIPOLYGON
## Dimension:     XY
## Bounding box:  xmin: 981976.6 ymin: 208788.5 xmax: 998281.4 ymax: 226338.3
## Projected CRS: NAD83 / New York Long Island (ftUS)
##                         zone   borough PUCounts DOCounts
## 1            Lenox Hill West Manhattan    20507    22965
## 2        Lincoln Square East Manhattan    25633    22808
## 3             Midtown Center Manhattan    43679    34938
## 4               Midtown East Manhattan    32666    27157
## 5                Murray Hill Manhattan    27706    28809
## 6  Times Sq/Theatre District Manhattan    29769    28543
## 7      Upper East Side North Manhattan    32212    33258
## 8      Upper East Side South Manhattan    38639    34043
## 9               Clinton East Manhattan    23813    23028
## 10              East Chelsea Manhattan    25916    24247
##                          geometry
## 1  MULTIPOLYGON (((994839.1 21...
## 2  MULTIPOLYGON (((989380.3 21...
## 3  MULTIPOLYGON (((991081 2144...
## 4  MULTIPOLYGON (((992224.4 21...
## 5  MULTIPOLYGON (((991999.3 21...
## 6  MULTIPOLYGON (((988786.9 21...
## 7  MULTIPOLYGON (((995940 2211...
## 8  MULTIPOLYGON (((993633.4 21...
## 9  MULTIPOLYGON (((986694.3 21...
## 10 MULTIPOLYGON (((983690.4 20...

Dropoffs are also very concentrated in central Manhattan area.

zoneshp_2 %>% filter(is.na(DOCounts))
## Simple feature collection with 19 features and 8 fields
## Geometry type: MULTIPOLYGON
## Dimension:     XY
## Bounding box:  xmin: 913175.1 ymin: 120121.9 xmax: 1049020 ymax: 246613
## Projected CRS: NAD83 / New York Long Island (ftUS)
## First 10 features:
##    LocationID OBJECTID Shape_Leng   Shape_Area
## 1         103      104 0.02122083 1.192053e-05
## 2         103      103 0.01430552 6.330564e-06
## 3         103      105 0.07742534 3.686364e-04
## 4         109      109 0.17826782 1.169601e-03
## 5         110      110 0.10394629 5.257451e-04
## 6         118      118 0.24396622 1.826939e-03
## 7         156      156 0.14447689 1.052122e-03
## 8         176      176 0.15199519 6.577821e-04
## 9         187      187 0.12686843 4.211958e-04
## 10        199      199 0.07780850 2.887475e-04
##                                             zone       borough PUCounts
## 1  Governor's Island/Ellis Island/Liberty Island     Manhattan       NA
## 2  Governor's Island/Ellis Island/Liberty Island     Manhattan       NA
## 3  Governor's Island/Ellis Island/Liberty Island     Manhattan       NA
## 4                                    Great Kills Staten Island       NA
## 5                               Great Kills Park Staten Island       NA
## 6                    Heartland Village/Todt Hill Staten Island       NA
## 7                                Mariners Harbor Staten Island       NA
## 8                                        Oakwood Staten Island        1
## 9                                  Port Richmond Staten Island       NA
## 10                                 Rikers Island         Bronx       NA
##    DOCounts                       geometry
## 1        NA MULTIPOLYGON (((973172.7 19...
## 2        NA MULTIPOLYGON (((972079.6 19...
## 3        NA MULTIPOLYGON (((979605.8 19...
## 4        NA MULTIPOLYGON (((943392.6 14...
## 5        NA MULTIPOLYGON (((951420.1 13...
## 6        NA MULTIPOLYGON (((954167.8 16...
## 7        NA MULTIPOLYGON (((934327.5 17...
## 8        NA MULTIPOLYGON (((950393.9 14...
## 9        NA MULTIPOLYGON (((946964.1 17...
## 10       NA MULTIPOLYGON (((1015024 230...

Aside from aforementioned three islands, there are also zones without taxi dropoff records, or no pickup/dropoff records at all.

2) Trip Duration (scrapped)

This part was not included to the final report; while the ‘Duration’ variable was crucial in data cleaning process, the team decided that further analysis on this variable was irrelevant to the goal of this research.

summary(data$Duration)
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
##  0.03333  7.01667 11.01667 12.35400 16.35000 40.23333
ggplot(data, aes(x= "Trips", y=Duration))+
  geom_boxplot(notch=TRUE, fill = "yellow", alpha = 0.3, size = 1.2)+
  labs(title = "Boxplot of Yellow Taxi Trip Durations")+
  theme_minimal()+
  theme(plot.title = element_text(hjust = 0.5))

#version 1
ggplot(data, aes(x=Duration))+
  geom_histogram(color="darkblue", fill="lightblue")+
  labs(title = "Distribution of Yellow Taxi Trip Duration")+
  theme_minimal()+
  theme(plot.title = element_text(hjust = 0.5))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

From the plots, we can observe that the duration of yellow taxi rides average in 12 minutes, with most rides lasting less than 10 minutes and very few rides last 20 minutes or longer.

3) Distribution of pickup counts by time

ggplot(data, aes(x=as.Date(tpep_pickup_datetime)))+
  geom_bar(color="darkred", fill="red", alpha = 0.3)+
  labs(title = "Daily Yellow Taxi Pickup Counts of August 2024")+
  theme_minimal()+
  theme(plot.title = element_text(hjust = 0.5))

ggplot(data, aes(x=as.Date(tpep_pickup_datetime)))+
  geom_histogram(color="darkred", fill="red", alpha = 0.3)+
  labs(title = "Daily Yellow Taxi Pickup Counts of August 2024")+
  theme_minimal()+
  theme(plot.title = element_text(hjust = 0.5))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Bar plot and histogram from the same data show different visuals, suggesting that the ‘tpep_pickup_datetime’ variable from the original data may not be in the proper date & time format.

4) Next steps

Since our analysis on taxi demand considers both space and time, time-related variables from the data set would need further processing. Overview of pickup counts based on time periods - hour, weekday, mainly - would be necessary to better reflect the goal of this project.